Reliability and validity of a widely-available AI tool for assessment of stress based on speech

Cigna’s online stress management toolkit includes an AI-based tool that purports to evaluate a person’s psychological stress level based on analysis of their speech, the Cigna StressWaves Test (CSWT). In this study, we evaluate the claim that the CSWT is a “clinical grade” tool via an independent validation. The results suggest that the CSWT is not repeatable and has poor convergent validity; the public availability of the CSWT despite insufficient validation data highlights concerns regarding premature deployment of digital health tools for stress and anxiety management.


Discussion
The CSWT is presented as a clinical grade tool and offered as a part of a broader stress management toolkit.The results herein fail to support the claim of clinical grade performance and raise questions as to whether the tool is effective at all.This external validation study found that the CSWT has poor test-retest reliability and poor validity.The convergent validity results suggest that the CSWT has limited agreement with the PSS.Even when both test administration results were used to predict the PSS using linear regression, the model explained only 6.9% of the variance in the PSS.Our findings align with previously-highlighted concerns that widespread adoption of AI technologies are being prioritized over ensuring the devices work 12 .The widespread availability of this tool for stress and anxiety management, particularly through a large insurance company, may lead users to rely on it for assessing psychological stress levels and making healthcare decisions.As a result, misleading or inaccurate results can contribute to a variety of negative consequences, such as inappropriate treatment, wasted resources, increased anxiety, or false reassurance.Additionally, the CSWT's interpretations of a respondent's results are not limited to state psychological stress (acute, transient) that the respondent may be feeling at the time they complete the test; rather, their interpretations extend to trait psychological stress (e.g., "you're under a balanced level of pressure day-to-day").Extrapolating trait psychological stress from a single 1-minute speech sample is unlikely to be feasible, even if the CSWT scores were valid and reliable in assessing state psychological stress.
The results of this study serve as an example of the fallacy of AI functionality 19 , where companies deploy AI tools under the assumption that they work but without requisite validation data.In healthcare, the mechanisms for verifying claims about a device's functionality are well-established 20,21 .Online digital health tools should not be exempt from this level of scrutiny.Any deployed digital health tools should be grounded in verifiable claims with published evidence of functionality.In the absence of such data, these tools should not be made widely available.
The results of this study further highlight the previously documented challenges associated with building speech-based measures of health 22 .The within-subject and between-subject variability associated with speech production makes robust cross-sectional prediction challenging.The lack of transparency with the CSWT (in terms of validation data, functionality, and contact information) also makes it difficult to evaluate model quality.While the CSWT does not make public the information regarding the underlying model (i.e., what acoustic and semantic features are used), the most common approach to building clinical speech models is supervised learning 23 .This is where the authors train high-dimensional models to predict a clinical variable of interest.It's been documented that models trained under this paradigm are less likely to generalize 22,23 , which can be partially attributed to the variability of commonly used features in the clinical speech literature 24 .We posit that feature variability imposes inherent limits on any algorithm's ability to accurately predict complex health constructs (i.e.psychological stress, depression, anxiety) directly from speech.It is important to note that this limitation cannot be overcome by collecting larger training data or using more complex models as it is a property of the variability associated with human speech production.

Method Participants
Our study included 60 participants over the age of 18, recruited at Arizona State University.The research was approved by the institutional review board of Arizona State University (IRB #00016588).The methods were carried out in accordance with the approved IRB and informed consent was collected from all participants via an online form prior to the start of the experiment.The inclusion criteria for the study were broad: all participants who spoke English and were over the age of 18.The Cigna StressWaves website indicates that the device can be used by all English speakers, even if English is not their primary language 13 .

Test setting
All participants used the same equipment (i.e., Logitech H390 Wired Headset connected to a Dell computer) and conducted the experiment in a quiet laboratory environment.Participants were not shown their CSWT stress scores.www.nature.com/scientificreports/

The Cigna StressWaves test
The CSWT is presented as a clinical-grade tool for assessing a patient's psychological stress level based on analysis of their speech.The user is prompted to select a question and provide a response lasting at least 60 seconds.
In this study, we asked participants to perform the test twice to evaluate test-retest reliability.Each participant responded to one of the eight prompts on two consecutive administrations of the test during the same session (all sessions lasted 10 min or less).The participant was able to freely choose any of the eight prompts for each of the two sessions.Only one participant chose the same prompt twice.The tool provides an ordinal scale output (i.e., low, moderate, or high) and a full-scale score presented on a gradient scale.Each participant also completed the 10-question PSS.The PSS is also scored numerically on a full scale and on a three-level ordinal scale (i.e., numerical range from 0 to 40; low, moderate, and high) 25 .The order of PSS and CSWT was randomized across participants.

Statistical analysis
The primary analysis in the study is the test-retest reliability, measured via the intra-class correlation (ICC) between the first and second administration of the CSWT.The secondary analysis is the evaluation of validity of the CSWT relative to the PSS, measured via the correlation between the PSS score and the average of the two CSWT scores.We average the scores between the two administrations to reduce CSWT variability.We use the PSS as a comparison as it produces a full-scale score on the same range as the CSWT.Both tests also provide ordinal ratings (low, moderate, high).For the ordinal ratings, we use Cohen's Kappa to assess repeatability of the ratings and validity relative to PSS.Statistical analyses were conducted using R Studio with the irr package 26 .

Power analysis
Sample size estimates are based on the primary analysis (test-retest reliability) using the method in 27 .We assume an expected ICC reliability of 0.75, per the definition of a clinical-grade test 18 .We set our threshold for acceptable ICC at the moderate level of 0.5.We use this lower threshold as a criterion because this is a novel test that relies on speech.Acoustic speech features inherently exhibit considerable variability, which we consider when establishing the lower performance benchmark 24 .For a significance level of 0.05 and a power of 80%, the required sample size is 55 subjects.We add an additional 5 subjects to account for potential dropouts, missing data, or issues during data collection.For the secondary analysis, a sample size of 55 subjects allows us to detect a correlation of at least 0.33 between the CSWT and PSS for a significance level of 0.05 and a power of 80% 28 .

Figure 1 .
Figure 1.The test-retest plot for the Cigna StressWaves test.Each pair of samples was measured during the same session.The intra-class correlation of the test is ICC = −0.106,p > 0.05.

Figure 2 .
Figure 2. Convergent validity plot for the Cigna StressWaves test relative to the Perceived Stress Scale.The correlation between the two scores is r = 0.200, p > 0.05.

Table 1 .
Descriptive statistics of the study sample (N = 60, 36 F, 24 M).The PSS and the Cigna SWT provide both continuous and ordinal outputs.The mean and standard deviation correspond to the continuous output whereas the range, median, and mode correspond to the ordinal outputs.